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Abstract 

We analyze the generalization performance of a student in a model composed of linear perceptrons: a 
true teacher, ensemble teachers, and the student. Calculating the generalization error of the student 
analytically using statistical mechanics in the framework of on-line learning, it is proven that when 
learning rate rj < 1, the larger the number K and the variety of the ensemble teachers are, the smaller 
the generalization error is. On the other hand, when rj > 1, the properties are completely reversed. If 
the variety of the ensemble teachers is rich enough, the direction cosine between the true teacher and the 
student becomes unity in the limit of rj — > and K — > oo. 

keywords: ensemble teachers, on-line learning, generalization error, statistical mechanics, learning 
rate 

1 Introduction 

Learning is to infer the underlying rules that dominate data generation using observed data. Observed 
data are input-output pairs from a teacher and are called examples. Learning can be roughly classified 
into batch learning and on-line learning P^. In batch learning, given examples are used repeatedly. In this 
paradigm, a student becomes to give correct answers after training if the student has adequate freedom. 
However, it is necessary to have a long amount of time and a large memory in which to store many 
examples. On the contrary, in online learning examples used once are discarded. In this case, a student 
cannot give correct answers for all examples used in training. However, there are merits, for example, 
a large memory for storing many examples isn't necessary, and it is possible to follow a time variant 
teacher. 

Recently, we analyzed the generalization performance of ensemble learning 0EIE1 m a framework 
of on-line learning using a statistical mechanical method Using the same method, we also analyzed 

the generalization performance of a student supervised by a moving teacher that goes around a true 
teacher T;. As a result, it was proven that the generalization error of a student can be smaller than a 
moving teacher, even if the student only uses examples from the moving teacher. In an actual human 
society, a teacher observed by a student doesn't always present the correct answer. In many cases, the 
teacher is learning and continues to change. Therefore, the analysis of such a model is interesting for 
considering the analogies between statistical learning theories and an actual human society. 

On the other hand, in most cases in an actual human society a student can observe examples from 
two or more teachers who differ from each other. Therefore, we analyze the generalization performance of 
such a model and discuss the use of imperfect teachers in this paper. That is, we consider a true teacher 
and K teachers called ensemble teachers who exist around the true teacher. A student uses input-output 
pairs from ensemble teachers in turn or randomly. In this paper, we treat a model in which all of the true 
teacher, the ensemble teachers and the student are linear perceptrons |Sj with noises. We obtain order 
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parameters and generalization errors analytically in the framework of on-line learning using a statistical 
mechanical method. As a result, it is proven that when student's learning rate 77 < f, the larger the 
number K and the variety of the ensemble teachers are, the smaller the student's generalization error 
is. On the other hand, when r/ > 1, the properties are completely reversed. If the variety of ensemble 
teachers is rich enough, the direction cosine between the true teacher and the student becomes unity in 
the limit of 77 — * and K — + 00. 



2 Model 

In this paper, we consider a true teacher, K ensemble teachers and a student. They are all linear 
perceptrons with connection weights A, Bk and «7, respectively. Here, k = 1, . . . , K. For simplicity, the 
connection weight of the true teacher, the ensemble teachers and the student are simply called the true 
teacher, the ensemble teachers and the student, respectively. True teacher A = (Ai, . . . , Ajy), ensemble 
teachers Bk = (Bki, . . . , Bfejv), student J = (Ji, . . . , J/v) and input x = (x\, . . . , xjv) are TV dimensional 
vectors. Each component A, of A is drawn from jV(0, 1) independently and fixed, where 7V(0, 1) denotes 
Gaussian distribution with a mean of zero and variance unity. Some components B^ are equal to Aj 
multiplied by -1, the others are equal to A,-. Which component Bki is equal to —A; is independent from 
the value of A,-. Hence, Bki also obeys A/"(0, 1). Bki is also fixed. The direction cosine between Bk and 
A is Rsk and that between Bk and Bk' is qkw ■ Each of the components jf of the initial value J° of J 
are drawn from A/"(0, 1) independently. The direction cosine between J and A is Rj and that between J 
and Bk is RBkj- Each component x% of x is drawn from Af(0, independently. Thus, 
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where (•) denotes a mean. 

Figure ^ illustrates the relationship among true teacher A, ensemble teachers Bk, student J and 
direction cosines qkk> ,RBk,Rj and Rskj- 

In this paper, the thermodynamic limit TV — > 00 is also treated. Therefore, 

\\a\\ = Vn, ||B fe || = Viv, \\j°\\ = Vn, ||a;|| = i. (7) 

Generally, norm || J\\ of the student changes as time step proceeds. Therefore, ratios l m of the norm to 
ViV are introduced and called the length of the student. That is, ||«7 m || = l m ^/N, where m denotes the 
time step. 

The outputs of the true teacher, the ensemble teachers, and the student are y m + n™, w™ + rig k and 
u m l m + 11™, respectively. Here, 
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Figure 1: True teacher A, ensemble teachers and student J. qkk' , Rj, RBk and Rskj are direction 
cosines. 



That is, the outputs of the true teacher, the ensemble teachers and the student include independent 
Gaussian noises with variances of a 2 A , cr% k , and ct}, respectively. Then, y m , v m , and u m of Eqs. (|51)- (|1U[) 
obey Gaussian distributions with a mean of zero and variance unity. 

Let us define error esk between true teacher A and each member B^ of the ensemble teachers by the 
squared errors of their outputs: 

e™ k ^l(y™ + n<X-vZ~n>g k ) 2 . (14) 

In the same manner, let us define error eskj between each member B^ of the ensemble teachers and 
student J by the squared errors of their outputs: 

e^kj^livT + n^-u-^r-nJ) 2 . (15) 

Student J adopts the gradient method as a learning rule and uses input x and an output of one of 
the K ensemble teachers B^ in turn or randomly for updates. That is, 

J m+1 = jm ~V^ (16) 
= J m + -q{vf + n^ k ~u m l m -nj)x m , (17) 

where rj denotes the learning rate of the student and is a constant number. In cases where the student 
uses K ensemble teachers in turn, k — mod (m, K) + 1. Here, mod (m, K) denotes the remainder of m 
divided by K. On the other hand, in random cases, A; is a uniform random integer that takes one of 

i,2,..., jr. 

Generalizing the learning rules, Eq. I|17|) can be expressed as 

jm+l = jrn +fkX m (lg) 

= J m + f(v k n +n% k ,u m l m +ny)x m , (19) 

where / denotes a function that represents the update amount and is determined by the learning rule. 

In addition, let us define error ej between true teacher A and student J by the squared error of their 
outputs: 

1 {y rn + n 7 2 - u m l m - njf . (20) 
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3 Theory 



3.1 Generalization error 

One purpose of a statistical learning theory is to theoretically obtain generalization errors. Since gener- 
alization error is the mean of errors for the true teacher over the distribution of new input and noises, 
generalization error tBkg of each member B k of the ensemble teachers and ej g of student J are calcu- 
lated as follows. Superscripts m, which represent the time steps, are omitted for simplicity unless stated 
otherwise. 

tBkg = J dxdn A dn Bk P (x,n A ,n Bk )e Bk (21) 

1 2 

dydvkdnAdn B kP (y, Vk, ua, risk) ^ (y + n A - v k - n B k) (22) 



^ '-2R Bk + 2 + a\ + <j 2 Bk ) , (23) 
J dxdriAdnjP (x,nA,nj) e.j (24) 



1 

dydudn A dnjP (y, u, n A , nj) -(y + n A -ul- nj) (25) 



= - (-2Rjl + I 2 + 1 + a\ + a 2 ,). (26) 

Here, integrations have been executed using the following: y, v k and u obey 7V(0, 1). The covariance 
between y and v k is R Bk , that between v k and u is R B kj, and that between y and u is Rj. All n A , n Bkl 
and nj are independent from other probabilistic variables. 

3.2 Differential equations for order parameters and their analytical solutions 

To simplify analysis, the following auxiliary order parameters are introduced: 

r.j = Rjl, (27) 
r BkJ = R BkJ l. (28) 

Simultaneous differential equations in deterministic forms [H], which describe the dynamical behaviors 
of order parameters, have been obtained based on self-averaging in the thermodynamic limits as follows: 

fc'=i 

d ir - (30 

fc=i 

dl 1 ( 1 , 2 



dt - + (3D 

k=l v 7 

Here, dimension N has been treated to be sufficiently greater than the number of ensemble teachers 
K. Time t = m/N, that is, time step m normalized by dimension N. Note that the above differential 
equations are identical whether the K ensemble teachers are used in turn or randomly. 

Since linear perceptrons are treated in this paper, the sample averages that appeared in the above 
equations can be easily calculated as follows: 

(fku) = v (^--l), (32) 

(/*) = if(l 2 -1r Bk j + l + a 2 Bk + a 2 ), (33) 
(f k y) = v(RBk-rj), (34) 
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Since all components A i: JP of true teacher A, and the initial student J° are drawn from jV(0, 1) 
independently and because the thermodynamic limit TV — *■ oo is also treated, they arc orthogonal to each 
other in the initial state. That is, 

Rj = when t = 0. (36) 



In addition, 



I = 1 when t = 0. 



(37) 



By using Eqs. (|3*2"|+(f3"7j) . simultaneous differential equations Eqs. J5nj-|j2U can be solved analytically 
as follows: 
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4 Results and Discussion 

In this section, we treat the case where direction cosines Rbr between the ensemble teachers and the true 
teacher, direction cosines qkk' among the ensemble teachers and variances a^ k of the noises of ensemble 
teachers are uniform. That is, 
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In this case, Eqs. lj4*T)l and are expressed as 
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The dynamical behaviors of generalization errors ej g have been analytically obtained by solving Eqs. 
(|26fl . (|27|l and l|38|) - (|47|l . Figure [21 shows the analytical results and the corresponding simulation results, 
where N = 2000. In computer simulations, K ensemble teachers are used in turn. ej g was obtained by 
averaging the squared errors for 10 4 random inputs at each time step. Generalization error es g of one of 
the ensemble teachers is also shown. The dynamical behaviors of R and I are shown in Fig. [21 
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Figure 2: Dynamical behaviors of generalization errors ej g . Theory and computer simulations. Conditions 
other than q are rj = 0.3, K = 3,R B = 0.7, cr^ = 0.0, <r| = 0.1 and cr 2 7 = 0.2. 



In these figures, the curves represent theoretical results. The dots represent simulation results. Con- 
ditions other than q are common: r\ — 0.3, K = 3, i?s = 0.7, cr^ = 0.0, crfj = 0.1 and a 2 j = 0.2. Figure|21 
shows that the smaller q is, that is, the richer the variety of the ensemble teachers is, the smaller gener- 
alization error ej g of the student is. Especially in the cases of q = 0.6 and q = 0.49, the generalization 
error of the student becomes smaller than a member of the ensemble teachers after i»5. This means 
that the student in this model can become more clever than each member of the ensemble teachers even 
though the student only uses the input-output pairs of members of the ensemble teachers. Figure |3] shows 
that the larger the variety of the ensemble teachers is, the larger direction cosine Rj is and the smaller 
length I of the student is. The reason minimum value 0.49 of q is taken as the squared value of Rb = 0.7 
in Figs. 0and|Slis described later. 
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Figure 3: Dynamical behaviors of Rj and I. Theory and computer simulations. Conditions other than q 
are rj = 0.3, K = 3, R B = 0.7, a\ = 0.0, cr| = 0.1 and ct 2 7 = 0.2. 



In Figs. |2] and OH £j g ,Rj and I almost seem to reach a steady state by t = 20. The macroscopic 
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behaviors of t — > oo can be understood theoretically since the order parameters have been obtained 
analytically. Focusing on the signs of the powers of the exponential functions in Eqs. I|38|) - (|40[l . we can 
see that ej g and I diverge if 77 < or 77 > 2. The steady state values of rskj,rj and I 2 in the case of 
< i] < 2 can be easily obtained by substituting t — > oo in Eqs. J2SJ)-(@U| as follows: 

r BkJ -> 9+~j^> ( 48 ) 
rj -> R B , (49) 
' 2 - ^^(l-^iq+'^+^l + al+aj)) (50) 

,2 



2 - T) V V # 

1-5 /(l-5)(if-l) 



2-7? V # 



Vb+Vj)- (51) 



Equations (|26[1 . I|27|l and (|48|l l|51|) show the following: in the case of rj = 1, the steady value of length 
I is independent from the number K of teachers and direction cosine q among the ensemble teachers. 
Therefore, the steady value of generalization error ej g and direction cosine Rj are independent from K 
and q in this case. In the case of < r) < 1, the smaller q is or the larger K is, the smaller the steady 
values of I and ej g are and the larger the steady value of Rj is. In the case of 1 < rj < 2, on the contrary, 
the smaller q is or the larger K is, the larger the steady values of I and ej g are and the smaller the steady 
value of Rj is. That is, in the case of 77 < 1, the more teachers exist and the richer the variety of teachers 
is, the more clever the student can become. On the contrary, in the case of rj > 1, the number of teachers 
should be small and the variety of teachers should be low for the student to become clever. 

In the right hand side of Eq. I|51l) . since the second and the third terms are positive, the steady 
value of I is larger than ^Jq. In addition, since / — > ^fq in the limit of ?/ — > and K — > oo, Eqs. 
(|?7j) and l)4*5jl show Rj — > Rs/^/q- On the other hand, when S and T are generated independently 
under conditions where the direction cosine between S and P and between T and P are both Ro, 
where S, T and P are high dimensional vectors, the direction cosine between S and T is go = 
as shown in the appendix. Therefore, if ensemble teachers have enough variety that they have been 
generated independently under the condition that all direction cosines between ensemble teachers and 
the true teacher are Rb, Rb/\/q = 1, then direction cosine Rj between the student and the true teacher 
approaches unity regardless of the variances of noises in the limit of 77 — > and K — > oo. 

Figures 0H3 show the relationships between learning rate 77 and ej g , Rj. In Figs 0] and El K = 3 and 
is fixed. In Figs HO and [7| q — 0.49 and is fixed. Conditions other than K and q are a\ = a B — <J 2 j = 0.0 
and Rb = 0.7. Computer simulations have been executed using r; = 0.3,0.6, 1.0, 1.4 and 1.7. The values 
on t = 20 are plotted for the simulations and considered to have already reached a steady state. 

These figures show the following: the smaller learning rate t] is, the smaller generalization error €j„ 
is and the larger direction cosine Rj is. Needless to say, when rj is small, learning is slow. Therefore, 
residual generalization error and learning speed are in a relationship tradeoff. The phase transition in 
which ej g diverges and Rj becomes zero on r\ — 2 is shown. In the case of r\ < 1, the larger K is or the 
smaller q is, that is, the richer the variety of ensemble teachers is, the smaller ej g is and the larger Rj 
is. On the contrary, the properties are completely reversed in the case of rj > 1. 

As described above, learning properties are dramatically changed with learning rate rj. It is difficult to 
explain the reason qualitatively. Here, we try to explain the reason intuitively by showing the geometrical 
meaning of rj. Figures IHIa)-(c) show the updates of rj = 0.5, rj = 1 and rj = 2, respectively. Here, the 
noises are ignored for simplicity. Needless to say, teacher itself cannot be observed directly and only 
output v can be observed when student J is updated. In addition, since the projections from «7 m+1 
to x m and from Bk to x m are equal in the case of ?; = 1, as shown in Fig. OJb), rj = 1 is a special 
condition where the student uses up the information obtained from input x m . In the case of 77 < 1, the 
update is short. Since in a sense this fact helps balance the information from the ensemble teachers, the 
generalization error of the student is improved when the number K of teachers is large and their variety 
is rich. On the other hand, the update is excessive when 77 > 1. Therefore, the student is shaken or 
swung, and its generalization performance worsens when K is large and the variety is rich. In addition, 
the reason that learning diverges if 77 < or 77 > 2 can be understood intuitively from Fig. [H] distance 
IK77 — l)(?j m — u m l m )x m \\, measured by the projections to x m between student J m+l after the update 
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Figure 4: Steady value of generalization error ej g in the case of K = 3. Theory and computer simulations. 
Conditions other than K and q are cr\ = <r B = a 2 , = 0.0 and = 0.7. 
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Figure 5: Steady value of direction cosine Rj in the case of K = 3. Theory and computer simulations. 
Conditions other than K and q are a\ — a 2 B = <t 2 j = 0.0 and Rb = 0.7. 
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Figure 6: Steady value of generalization error e.j g in the case of q = 0.49. Theory and computer simula- 
tions. Conditions other than K and q are u\ = <r% = <Tj = 0.0 and Rb = 0.7. 
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Figure 7: Steady value of direction cosine Rj in the case of q = 0.49. Theory and computer simulations. 
Conditions other than K and q are a\ — <r B = Oj = 0.0 and Rb = 0.7. 
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and teacher Bk, is larger than distance \\(v m — u m l m )x m \\ between student J m before the update and 
teacher Bk in the case of 77 < or 77 > 2. Therefore, the learning diverges. 



r)(v m - u m r)x m n(v m - u m i m )x m n(v m - u m r)x' 




(a) 77 = 0.5 (b) r) = 1 (c) T] = 2 

Figure 8: Geometric meaning of learning rate 77 



5 Conclusion 

We analyzed the generalization performance of a student in a model composed of linear perceptrons: a 
true teacher, ensemble teachers, and the student. The generalization error of the student was analytically 
calculated using statistical mechanics in the framework of online learning, proving that when learning 
rate rj < 1, the larger the number K and the variety of the ensemble teachers are, the smaller the 
generalization error is. On the other hand, when r\ > 1, the properties are completely reversed. If the 
variety of ensemble teachers is rich enough, the direction cosine between the true teacher and the student 
becomes unity in the limit of 77 — > and K — > 00. 
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A Direction cosine q among ensemble teachers 

Let us consider the case where S and T are generated independently satisfying the condition that direction 
cosines between S and P and between T and P are both Rq, as shown in Fig. where S, T and P are 
N dimensional vectors. In this figure, the inner product of s and t is 

= ( s - Eo M P )'( T - jRo M P ) (52) 

- ||S||||r||(5 -^), (53) 

where s and t are projections from S to the orthogonal complement C of X and from T to C, respectively. 
qo denotes the direction cosine between S and T. 

Incidentally, if dimension N is large and S and T have been generated independently, s and t should 
be orthogonal to each other. Therefore, qo = Rq. 
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Figure 9: Direction cosine among ensemble teachers 
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